Index Compression and Redundancy Elimination in Large Textual Collections
نویسنده
چکیده
vii
منابع مشابه
Analysis of Lossless Reversible Transformation Algorithms to Enhance Data Compression
In this paper we analyze and present the benefits offered in the lossless compression by applying a choice of preprocessing methods that exploits the advantage of redundancy of the source file. Textual data holds a number of properties that can be taken into account in order to improve compression. Pre-processing cope up with these properties by applying a number of transformations that make th...
متن کاملFast Relative Lempel-Ziv Self-index for Similar Sequences
Recent advances in biotechnology and web technology are generating huge collections of similar strings. People now face the problem of storing them compactly while supporting fast pattern searching. One compression scheme called relative Lempel-Ziv compression uses textual substitutions from a reference text as follows: Given a (large) set S of strings, represent each string in S as a concatena...
متن کاملRedundancy Elimination Within Large Collections of Files
Ongoing advancements in technology lead to everincreasing storage capacities. In spite of this, optimizing storage usage can still provide rich dividends. Several techniques based on delta-encoding and duplicate block suppression have been shown to reduce storage overheads, with varying requirements for resources such as computation and memory. We propose a new scheme for storage reduction that...
متن کاملLeveraging naturally distributed data redundancy to optimize collective replication
Dumping large amounts of related data simultaneously to local storage devices instead of a parallel file system is a frequent I/O pattern of HPC applications running at large scale. Since local storage resources are prone to failures and have limited potential to serve multiple requests in parallel, techniques such as replication are often used to enable resilience and high availability. Howeve...
متن کاملUsing the Web to Reduce Data Sparseness in Pattern-Based Information Extraction
Textual patterns have been used effectively to extract information from large text collections. However they rely heavily on textual redundancy in the sense that facts have to be mentioned in a similar manner in order to be generalized to a textual pattern. Data sparseness thus becomes a problem when trying to extract information from hardly redundant sources like corporate intranets, encyclope...
متن کامل